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ABSTRACT 

Recent literature provides many computational and modeling approaches for 
covariance matrices estimation in a penalized Gaussian graphical models but 
relatively little study has been carried out on the choice of the tuning param- 
eter. This paper tries to fill this gap by focusing on the problem of shrink- 
age parameter selection when estimating sparse precision matrices using the 
penalized likelihood approach. Previous approaches typically used K-fold 
cross-validation in this regard. In this paper, we first derived the generalized 
approximate cross-validation for tuning parameter selection which is not only 
a more computationally efficient alternative, but also achieves smaller error 



rate for model fitting compared to leave-one-out cross-validation. For consis- 
tency in the selection of nonzero entries in the precision matrix, we employ a 
Bayesian information criterion which provably can identify the nonzero con- 
ditional correlations in the Gaussian model. Our simulations demonstrate 
the general superiority of the two proposed selectors in comparison with 
leave-one-out cross-validation, ten-fold cross-validation and Akaike informa- 
tion criterion. 

Keywords: Adaptive Lasso; BIC; Generahzed approximate cross-validation; 
Precision matrix; SCAD penalty. 

1 INTRODUCTION 

Undirected graphical models provide an easily understood way of describing 
and visualizing the relationships between multiple random variables. Usually 
the goal is to seek the simplest model that adequately explains the observa- 
tions. In the Gaussian case, the zero entries in the inverse of the covariance 
matrix, or the precision matrix, correspond to independence of those two 
random variables conditional on all others. 

Efficient estimation of a covariance matrix or its inverse is an important 
statistical problem. In network modeling, for example, correct identification 
of the nonzero entries provides an understanding of the relationships between 
the expression levels of different genes. There also exists an abundance of 
methods in multivariate analysis that requires the estimation of the covari- 
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ance matrix or its inverse, such as principal component analysis (PCA) or 
linear discriminant analysis (LDA). 

Due to the recent surge in interests in the estimation of the covariance 
matrix with the appearance of a large amount of data generated by mod- 
ern experimental advances such as microarrays, a large number of different 
estimators have been proposed by now. Mostly motivated by the fact that 
the sample covariance matrix is typically a noisy estimator when the sample 
size is not significant and the resulting dense matrix is difficult to interpret, 
almost all modern estimators regularize the matrix by making it sparse. This 
notion of sparsity in the context of esti mating covarianc e matrices has been 



Dempsted (119721 ) who simplified the 



noticed by some author as early as in 
matrix structure by setting some entries to zero. 

Traditionally, for the identification of zero elements, forward-selection or 
backward-selection is performed with each element tested at each step. How- 



ever, it is con aputational 



of variables. 



y infea sible even for data with a moderate number 



Li and Guil (120061 ) proposed using threshold gradient descent 



for estimating the 



sparse precision matrix in the Gaussian graphical models. 



Bickel and Levinal (120081 ) developed asymptotic theories on banding methods 
for both covariance and precision matrix estimation based on direct thresh- 
olding of the elements. Another way to estimate the graphical model is to 
perform a regress i on for each variable on all the remaining ones. For example. 



Dobra and West 



(120041 ) used a Bayesian approach that employs a stochastic 
algorithm which can deal with tens of thousands of variables. 

Recently, there has been much interest on the estimation of sparse covari- 



anceor precision matrices using penalized likelihood method. iMeinshausen and Buhlmann 



(I2OO6I ) estimates the conditional independence separately for each random 
variable using the Lasso penalty. Note that setting up separate regressions 
for each node does not result in a valid likelihood for the covariance ma- 
trix and thus in order to obtain a positive-defi nite estimator, some post- 



Yuan and LinI (120071 ) used 



processing is typically performed as the last step, 
semi-definite programming algorithms to achieve sparsity by penalizing the 
normal likelihood with Lasso penalty on the elements of the precision matrix. 

The Lasso penalty will force some of the coefficients to be exactly zero. 
Compared to traditional model selection methods using information crite- 



ria, Lasso is continuous and thus more stable. 



iowever, several authors 



(IMeinshausen and Buhlmann 



2006 



Zhao and Yu 



20061 ) have noted that 



Lasso is in general not consistent for model selection unless some nontrivial 
conditions on the covariates are satisfied. Even when those conditions are in- 
deed satisfied, the efficiency of the estimator is compromised when one insists 
on variable selection consistency si nce the coefficients are over-shrunk. To ad- 
dress these shortcomings of Lasso, iFan and Lil (120011 ) proposed the smoothly 



clipped absolute deviation (SCAD) penalty which takes into account several 
desired properties of the estimator such as sparsity, continuity, asymptotic 
unbiasedness. They also show that the resulting estimator possesses the or- 
acle property, i.e. it is consistent for variable selection and behaves the same 
as when the zero coefficients are known in advance. These results are ex- 
tended t o th e case with a diverging number of covariates in 



(|2004h 



Fan and Peng 



Zoul (I2OO6I ) proposed adaptive Lasso using a weighted Li penalty 



with weights determined by an initial estimator and similar oracle property 
followed. The idea behind the adaptive lasso is to assign higher penalty for 
zero coefficients and lower penalty for larger co efficients. 
In the context of Gaussian graphical models, 



Lam and Fan 



(120071 ) studied 



rates of convergence of sparse covariance/precision matrices estimation via 
a general penalty function including SCAD and adaptive Lasso penalties as 
special cases and showed that these penalt y function s atten uated the bias 



Fan et al. 



fl2008h . through local 



problem associated with Lasso penalty. In 
linear approximation to the SCAD penalty function, efficient computation 
of the penalized likeliho od is achieved using the graphic al Lasso algorithn i 



of 



Friedman et al. 



(120081 ). Oracle properties as d efined in 



Fa 



Fan et al 



n and Lil (1200 ll ) 



(12008h . 



were shown for the precision matrix estimator in 

The attractive oracle property of the penalized estimator depends criti- 
cally on the appropriate choice of the tuning parameter. For penalized likeli- 
hood method, most of the studies mentioned above employed cross-validation 
(CV) for the selection of the tuning parameter. Cross-validation requires fit- 
ting the model based on different subsets of the observations multiple times, 
which increased the computational complexity of this a pproach. As an ap 
proxim ation to leave-one-out cross-validation (LOQCV). ICraven and Wahbc 
( 119791 ) proposed generalized cross-validation (GCV) for smoothing spline, 
and further derived generalized approximat e cross-validation (GACV) for 



non-Gaussian data (IDong and Wahba 



19961 ). We will follow similar strate- 



gies and derive GACV for the Gaussian graphical model that can be com- 
puted efficiently without multiple fitting of the model. However, for linear 



regression, iWang et al.l (120071 ) showed that generahzed cross vahdation can- 
not select the tuning parameter satisfactorily. In particular, the tuning pa- 
rameter chosen by GCV tend to overestimate the model size. Because of the 
asymptotic equivalence of generalized cross-validation, leave-one-out cross- 
validation and Akaike information criterion (AIC) in this simple model, the 
authors proposed to use Bayesian information criterion (BIG) for consistent 
model selection. We will demonstrate that similar conclusions can be reached 
for our problem of precision matrices estimation. 

In this paper, we es timate the precis ion matrices using the same compu- 



tational approach as in 



Fan et al. 



(120081 ). However, we focus on the problem 



of tuning parameter selection in penalized Gaussian graphical model. In the 
next section, two selectors, GAGV and BIG, are proposed in this context. 
Simulation studies are carried out in Section 3 which demonstrate the supe- 
rior performance of the proposed selectors. Finally, some remarks conclude 
the article in Section 4. 

2 PENALIZED ESTIMATION AND TUN- 
ING PARAMETER SELECTION 

Suppose xi, . . . ,x„ are i.i.d. observations generated from a p— dimensional 
multivariate Gaussian distribution with mean and unknown covariance 
matrix Sq, where Xj = {xn, . . . ,Xip)'^. Denote the sample covariance matrix 
by S = Xir=i l"^- The inverse of S is a natural estimator of the precision 
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matrix Qq = Sq ^ which is the estimator that maximizes the Gaussian log- 
hkehhood 

max log — Tr(Sri), 

where |r2| is the determinant of the matrix fl. 

However, if the true precision matrix is known to be sparse or the di- 
mension of the random vector is large, the performance of the estimator can 
typically be improved by maximizing the penalized log-likelihood instead: 



max log I n I - Tr (Srj) - ^ ^ px^^ {\ujij\), 



i=i j=i 



where tUij^i = 1, . . . ,p, j = 1, . . . ,p, are the entries of d and Xij are the 
tuning parameters, which for now are left unspecified to allow for very general 
penalty function s. Note that we also penalize the diagonal elements as in 



Fan et al. 



(120081). 

In the above, using the penalty Xij = \,px{\x\) = X\x\ produces the 
Lasso estimator, while using X^j = X/\ijJij\'^, where 7 > and Uij is any 
consistent estimator of uJij, paired with the same pA(|a^|) = X\x\ produces 
the adaptiv e Lasso estimator . Another commonly used penalty function 



proposed by iFan and Lil (1200 ll ) is the SCAD penalty defined by its derivative 



pxi\x\) = XIi\x\ < A) + > A), 

(a - 1) 



with a = 3.7 according to the suggestion made in 



Fan and Lil (1200 ih . Unlike 



the Lasso estimator, which produces substantial biases for large elements. 
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the adaptive Lass o as well as the SC AD estimator achieved the oracle prop- 



erty as defined in iFan and Lil (|2001| ). which estimates the precision matrix 
as well as in the ideal situation where the zero elements are known. Efficient 
maximization with either Lasso or adaptive La sso penalty can be dir ectly 



performed using the graphical Lasso algorit hm (IFriedman et al. 



Fan et al. 



20081). To 



(120081 ) suggested us- 



take advantage of graphical Lasso algorithm, 
ing local linear approximation to recast the SCAD penalized likelihood as 
weighted Lasso in each step. It was pointed out that a one-step algorithm 
can perform as well as the f ully iterative loca l linear approximation algorithm. 



The reader are referred to 



Fan et al. 



(120081 ) for further details. 



Here we focus on tuning parameter selection which consists of a single 



parameter A for all three penalty functions mentioned above. In 



Fan et al. 



(120081 ). it was shown that for both adaptive Lasso penalty and the SCAD 



penalty, when A ^ and y/n\ oo, the resulting esti mators attain th e 



Fan et al. 



feoosl), 



oracle property. Hence, the choice of A is critical. In 
they proposed to use K-fold cross-validation (KCV), with = 10 probably 
the most popular choice in the literature. In K-fold cross-validation, one 
partitions the data into K disjoint subsets and chooses A that maximizes the 
score 

K 

KCV{\) = Y^rik (log|ri(-'=)(A)| -Tr(S(-'=)^2(-'=)(A))) , 

k=l 

where Uk is the sample size of the k—th partition, S'-"'^-' is the sample covari- 
ance matrix based on the data with k~th partition excluded, and f2(~'^)(A) 
is the estimate of the precision matrix based on the data excluding the k—th 
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partition with A as the tuning parameter. 

The usual leave-one-out cross-validation (LOOCV) is just a special case 
of the above with K = n so that each partition consists of one single obser- 
vation. Computation of KCV entails K maximization problems fitted with 
each partition deleted in turn for each fixed A as well as a final maximiza- 
tion using the optimal tuning parameter. Thus the computation of KCV 
is infeasible for large datasets especially when K is large. In the smoothing 
spline literature, the most popular approach for tuning parameter selection is 
the g eneralized cross-val i datio n (GCV) for Gaussian data. For non-Gaussian 



data, 



Dong and Wahbal (119961 ) proposed the generalized approximate cross- 



validation (GACV), which is obtained by constructing an approximation to 
the LOOCV based on the log-likelihood. The formula presented there does 
not directly apply to our problem since their derivation is based on regression 
problems. We also derive a GACV score based on an approximation to the 
LOOCV for our problem. The derivation of GACV is complicated by the 
multivariate nature of each left-out observation. The detail is deferred to 
Appendix A. Maybe surpringsingly, even though GACV is motivated by an 
efficient approximation to LOOCV, it almost always performs better than 
LOOCV and sometimes better than ten-fold CV as demonstrated by the 
simulation studies in the next section. 

In classical model selection literature, it is well understood that CV, GCV 



and AIC share similar asymptotic propertie s. All these criteria are 



ficient but inconsistent for model selection (IShao 



1997 



Yang 



OSS ef- 



20051). For 



penalized linear regression with the SCAD penalty, it has been verified by 



Wang et al.l (120071 ) that GCV is not able to identify the true model con- 
sistently when it is used fo r tuning parameter selection. To address this 



problem, 



Wang et al 



(120071 ) employed a variable selection criterion known 
to be consistent in the classical literature, BIG, as the tuning parameter se- 
lector and proved that the resulting tuning parameter can identify the true 
model consistent l y. Sim ilar conclusion h a s been drawn for adaptive Lasso 



(IWang and Leng 



2003). In 



Wang et al. 



(120081 ). the theory is further ex- 



tended to linear regression problems with a diverging number of parameters. 
In our context, with BIG, we select the optimal A by minimizing 



BIC{X) = -log|^2(A)| +Tr(Sf2(A)) + A; 



logra 



n 



where k is the number of nonzero entries in the upper diagonal portion of 
the estimated precision matrix ri(A). 

We conjecture that our GAGV proposed above is also not appropriate for 
model selection, although a formal proof seems illusive due to the complicated 
form of the GAGV score. Nevertheless, we can extend the consistency proof 
for BIG using empirical process theory to show that the tuning parameter 
selected by BIG will result in consistent model identification. The result is 
stated below while the proof is relegated to Appendix B. 

Theorem 1 Denoting the optimal tuning parameter selected using BIC by 
^Bic, if Conditions 1-3 in Appendix B hold, then the penalized likelihood es- 
timator correctly identifies all the zero elements in the true precision matrix. 



That is, pr{S 



^BIC 



St) I, where Sx = {{hj) ■ i < 7^ 0} is 
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the set of nonzero entries above and including the diagonal in the estimated 
precision matrix and St is similarly defined to be the set of nonzero entries 
in the true precision matrix Q,^. 

The conditions imposed on the penalty function can be verified for both 
SCAD penahy and adaptive Lasso penalty. Thus for these two penalty func- 
tions, BIG identifies the correct model for the precision matrix. 



3 SIMULATIONS 

In this section we compare the performance of different tuning parameter 
selectors for Lasso, adaptive Lasso and SCAD penalty estimators in Gaussian 
graphical models. For adaptive Lasso penalty with Aj-,- = X/lCjijl'^ , we use the 
Lasso estimator as the initial consistent estimator and set 7 = 0.5 following 



Fan et al. 



fl2008h . Besides CV, KCV (ten-fold), GACV and BIG, we also use 



AIG as the tuning parameter selector, which is defined by 

AIC{\) = - log |0(A) I + Tr(Sri(A)) + 2k/n, 

where k is the number of nonzero entries in the upper diagonal part of ri(A). 

We use three examples with different covariance structures in our simu- 
lation. The first one has a tridiagonal precision matrix: 
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The second one has a dense precision matrix with exponential decay: 

where no entries are exactly zero but many are so small that penalization is 
expected to reduce the variability of the estimators. The third one has a more 
general sparse precision matrix. For each non-diagonal element uj^jji < j of 
rio, with probability 0.8 we set it to be zero, otherwise we sample a value for 
uj^j from a uniform distribution on [—1, —0.5] U [0.5, 1]. Each diagonal entry 
is set as twice of the sum of the absolute values of the corresponding row 
elements excluding the diagonal entry itself. 

For each example, we use n i.i.d. random vectors generated from a multi- 
variate Gaussian distribution A^(0, fig ^) with dimension p = 20. We consider 
both n = 50 and n = 100. The errors are calculated based on 500 simulated 
datasets in each example. To compare the performance of different selec- 



tors, we use t 



l e Fro benius norm ||r2o — ^\\f as well as the entropy loss 



( lYuan and Lin 



20071 1 defined by 



Entropy(Oo, O) = - log iri^ + Tr(rio ^fl) - p. 



For the performance in terms of sparsity of the matrix, we use false positives 
(number of zero entries in the true precision matrix identified to be nonzero) 
and false negatives (number of nonzero entries in the true precision matrix 
identified to be zero). 
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The results for the Gaussian models are summarized in Tables 1-3 for 
Lasso penalty, adaptive Lasso penalty and SCAD penalty respectively, which 
gives the average losses and the average number of false positives and neg- 
atives for each case together with the corresponding standard errors. For 
sparse precision matrices, the BIC approach outperforms LOOCV, KCV and 
AlC in terms of the two loss measures as well as the sum of false positives 
and false negatives. For dense matrices, although the number of false neg- 
atives is generally larger compared to other selectors, which is certainly as 
expected, the performance of BIC in terms of loss is still superior. Based 
on our simulations, tuning parameter selected by AIC generally performs 
the worst. Finally, maybe surprisingly, the performance of GACV is almost 
always better than LOOCV, and in many cases also better than KCV. The 
reader should note that GACV can be computed much faster than either 
LOOCV or ten-fold CV. 

4 CONCLUDING REMARKS 

In this paper we compare several approaches for tuning parameter selection 
in penalized Gaussian graphical models. As an approximation to leave-one- 
out cross-validation, we derived generalized approximate cross-validation in 
the current context which is much faster to compute. Simulations show that 
GACV even outperforms the leave-one-out version. For model identification, 
we employ BIC for tuning parameter selection, and proved its consistency 
property. In our simulations with sparse matrices or dense matrices with 
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n=100 



n=50 
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n=100 



n=50 
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n=100 



n=50 
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CV 
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0.97 


0.94 
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Q 


Q 


Q 


Q 
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(13.74) 


(6.23) 
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(12.61) 


(12.21) 
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(9.68) 
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(0.49) 
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(5.85) 
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(11.19) 
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Table 1 : Simulation results for Lasso estimators using five different tuning param- 
eter selectors. The reported average errors are based on 500 simulated datasets. 
The numbers in the brackets are the corresponding standard errors. 
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1.40 


4.61 


1.97 


1.81 


1.54 




(0.48) 


(1.33) 


(0.36) 


(0.46) 


(0.69) 


FP 


49.11 


212.26 


94.42 


83.44 


58.26 




(28.73) 


(47.45) 


(15.51) 


(25.35) 


(38.58) 


FN 


54.34 


21.53 


45.11 


47.91 


52.72 




(11.19) 


(10.21) 


(8.82) 


(11.06) 


(11.86) 



Table 2: Simulation results for adaptive Lasso estimators using five different tun- 
ing parameter selectors. 
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Tridiagonal 
n=100 



n=50 



Dense 
n=100 



n=50 



General sparse 
n=100 



n=50 





BIC 


AIC 


CV 


KCV 


GACV 


Probenius 


5.84 


22.80 


10.29 


8.02 


6.07 




(4.06) 


(7.20) 


(3.29) 


(2.83) 


(5.13) 


Entropy 


0.90 


2.20 


1.25 


1.06 


1.10 




(0.26) 


(0 44) 


(0 32) 


(0 31) 


(0 41) 


FP 


64.08 


158.79 


99.45 


61.23 


84.93 




(20.58) 


(26.89) 


(22.97) 


(20.59) 


(26.51) 


FN 


3.78 


1.32 


3.11 


3.28 


3.31 




(2.93) 


(1 38) 


(2 94) 


Cl 58) 


(2 54) 


Frobcnius 


12.39 


62.83 


18.95 


15.38 


18.39 




(S 76) 


(36.70) 


(11 29) 


(4 24) 


(11.22) 


Entropy 


1.71 


6.55 


2.18 


1.94 


2.17 




(1 41") 


Cl 71) 


Co 76) 


(0.38) 


(1 74) 


FP 


65.48 


180.72 


97.79 


81.38 


89.23 




(23.53) 


(27.83) 


(23.24) 


(15.75) 


(25.80) 


FN 


8.15 


4.76 


4.85 


7.69 


7.84 




(4.45) 


(2.85) 


(5.66) 


(7.16) 


(4.21) 


Frobenius 


1.68 


6.75 


1.94 


1.92 


1.72 




(0.78) 


(2.32) 


(0.23) 


(0.23) 


(1.25) 


Entropy 


0.82 


2.37 


0.91 


0.89 


0.89 




(0.20) 


Co 54") 


Co 12) 


(0 12) 


(0.23) 


FP 



















(0) 


(0) 


(0) 


(0) 


(0) 


FN 


268.78 


112.72 


203.94 


202.06 


212.92 




(51.99) 


(30.07) 


(31.79) 


(21.59) 


(46.41) 


Probenius 


2.64 


29.53 


3.04 


3.34 


3.11 




(1 87) 


(10.51) 


(0.52) 


(Q 57) 


(2.30) 


Entropy 


1.23 


7.03 


1.48 


1.57 


1.37 




(0.83) 


(1.29) 


(0.19) 


(0.19) 


(1.32) 


FP 



















(0) 


(0) 


(0) 


(0) 


(0) 


FN 


244.80 


103.87 


219.14 


237.14 


221.14 




(81.30) 


(12.99) 


(32.97) 


(32.97) 


(65.51) 


Frobenius 


19.99 


72.61 


30.87 


21.63 


20.68 




(6.69) 


(26.93) 


(17.37) 


(6.90) 


(8.05) 


Entropy 


0.78 


2.41 


1.26 


0.92 


0.95 




(0.31) 


(0.39) 


(0.38) 


(0.24) 


(0.27) 


FP 


73.42 


155.89 


109.65 


81.25 


84.29 




27.41) 


(21.76) 


(21.94) 


(17.53) 


23.37) 


FN 


58.10 


26.51 


49.84 


50.07 


41.29 




(12.25) 


(9.12) 


(11.46) 


(12.54) 


(11.64) 


Frobenius 


34.58 


133.92 


45.50 


39.37 


39.85 




(25.36) 


(86.84) 


(15.84) 


(14.19) 


(22.57) 


Entropy 


1.06 


5.43 


2.04 


1.24 


1.31 




(0.45) 


(1.65) 


(0.68) 


(0.52) 


(10.41) 


FP 


67.75 


167.04 


102.08 


83.32 


79.14 




(26.64) 


(36.23) 


(23.48) 


(15.30) 


(25.94) 


FN 


53.85 


29.95 


38.51 


44.34 


44.00 




(10.29) 


(10.65) 


(11.05) 


(9.37) 


(9.25) 



Table 3: Simulation results for SCAD estimators using five different tuning 
rameter selectors. 
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many small entries, tuning parameter selected based on BIC clearly has bet- 
ter performance than all other approaches. 



APPENDIXES 

A. Derivation of GACV 

Denote the log-likelihood function by 

L(S,0) =log|0| -Tr(SO). 

In this section only, to simplify notation, the shrinkage estimator f2(A) is 
simply denoted by fl and similarly r2(~')(A) is denoted by fi^"'). Thus it is 
implicitly understood that the estimators depend on a fixed A. Let Xj = 
Xjxf be the covariance matrix based on a single observation so that S — 
J27=i '^i/n. The LOOCV score is defined by 

n n 

cv{X) = 5^log|^^(-*)|-Tr(5^x,^2(-')) = ^L(x„^2(-*)) 

i=l i 1=1 

n 

= nL(S,0) + 5^(L(X,,12(-^))-L(X,,0)) 

n 
i=l 

where we interpret dL{X.i, f2) / dfl — dL(X.i, fl) / dvec{fl) to be a p^— dimensional 
column vector of partial derivatives and similarly dCl = vec(f2*^~*-' — CI). Be- 
sides, here as well as in the following, like the definition of generalized degrees 
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dL{X„n) 

dn 



1 



dn, 



of freedom in iFan and Lil (120011 ) and iWang et al.l (120071 ). the partial deriva- 
tives corresponding to the zero elements in Q, are ignored. 



Using matrix calculus such as presented in 



Bishop! (120061 ). we have 



vec(n-^ - X,- 



and we only need to deal with the term dfl = vec(r2'-~*^ — ft). 

Denote the penalized log-likelihood based on the sufficient statistic S by 
L(S, O) = L(S, O) - Y.ijPx,,{\uJij\), Taylor expansion gives us 







dL{s^-'\n^-'^) 
~ dn ~_ 
dL{s,n) d^L{s,n] 



dn 



+ 



dO? 



dn + 



d^L{s,n) 

dftdS 



dS, 



where d'^L{S,n) /dfl'^ = d{L{S,n)/dvec{n))/dvec{n) is the x Hes- 
sian matrix, and d'^L{S,n) /dfldS is defined similarly. Like before, dfl = 
vec(ri*^~*) — fl) and dS = vec(S*^~*^ — S) actually denote their vectorized 
version. 

Since dL{S, n)/dfl = 0, it immediately follows that 



dnds 



From matrix calculus, we have (i^L(S, r2)/(ir2(iS = — Ip2xp2 and d'^L[S,Q)/dfl'^ 
-{fl^^ ® ri^^ + D) where D is a diagonal matrix with diagonal elements 
'Px, i\^ij\)- Thus we have the approximation c/O ^ [(i^L(S, f2)/c/0^] ^ dS. 
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Even for moderate p, inversion of this x matrix is computationally in- 
feasible. However, note that typically we consider only the situation with 
A = 0(1) and p-(.^(|<^ij|) = o(l) (for example, the second derivative for 
SCAD penalty function is exactly zero for nonzero elements). Thus we 
can approximate (g) + D)~^ by (f2~^ (g) f2~^)~^ = f2 (g) 12 and 

dfl pa — (f2 ® fl)dS = vec(ri ■ (S^*) — S) • O) which involves only p x p 
matrices and no inversion of matrices is required. 

B. Proof of Theorem 1 

We only need to assume the general conditions on the penalty function that 
guarantees the oracle property of the estimator with appropriately chosen 
tuning parameter. In particular, wc assume that 

Condition 1. max{|p';;^(|u;°.|) : u;^. ^ 0} ^ 0. 

Condition 2. lim inf „^oo 1™ inf a;^o+ Pa„ (^) /'^n > 0. 

Condition 3. The (theoretically) optimal tuning parameter satisfies A„ — >■ 

and \/n\n — > 00. 

For an arbitrary model S specified by the constraints that only some 
of the elements in the precision matrix can be nonzero, i.e. S C {(i,j) : 

1 < j} is the set of elements not constrained to be zero, denote by Ls 
the value of the constrained maximized likelihood with infinite data: Ls — 
maxn £'(log |r2| — Tr(Xf2)), where the maximization is performed over Q, 
with zero entries for all (i, j) G S. We partition A = [0, 00) into three parts: 
Ao = {A e A : 5a = 5r}, A_ - {A e A : 5a ^ 5t}, A+ = {A e A : 5a 2 St], 
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where Sx is the model identified by the estimator when A is used as the tuning 
parameter, and St is the true model St = {(hj) '■ ^ ^ Jj^fj 0}- We will 
prove separately that under-fitting probability and over-fitting probability 
are both negligible. 

Bounding the under-fitting probability: If Sx ^ St, we have that 

BIC{X) = -L{S,n{X)) + \Sx\^^ 

n 

> -L(s,n.j + |sy°^" 



n 



where fls is the unpenalized maximum likelihood estimator based on model 
S. The convergence above is based on the uniform convergence of the empir- 
ical distribution since the class of log-likelihood functions is Glivenko-Catelli. 
Similarly, with the optimal choice of that satisfies Condition 3, 



BIC{Xn) = -L{S,Cl{Xn)) + \SxJ^^ ^ -Ls,. 



Thus we have pr{infAeA_ BIC{X) > BIC{Xn)} 1 since —Ls^ < —Ls when 
S ^ St. 

Bounding the over- fitting probability: Now suppose A e A_|_. Since 



BIC{Xn) = -L{S,n{Xn)) + \SxJ^^ 

n 
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and 

BIC{\) = -L{S,n{\)) + \Sx\^-^, 

Th 

with |<Sa| > = 15^1 (with probability 1), we will get pr(inf;^gA+ -B/C( A) > 
BIC{Xn)) ^ 1 if it can be shown that L(S, Cl{X)) - L{S, ^2(A„)) = Op{l/n). 

We will use the usual notion for sample mean: P„L(X, Q) — Yli=i L(X.i, fl)/n — 
L(S, ft) and use PL(X., f2) to denote the corresponding population mean for 
a fixed precision matrix 17. Let G„ = ^/n{Pn — P) be the empirical process. 
We write 

L(S,f^(A))-L(S,rj(A„)) 

< P„L(X,05j-P„L(X,f2(A„)) 

= P„L(X, (is^) - PL(X, - (Pni^(X, 12(A„)) - PL(X, f2(A„))) 
+PL(X, l^^J - PL(X, ^^o) - (Pi:(X, ^^(An)) - PL(X, l^o)) , 

where f2o is the true precision matrix. Define Mi(X) = (^(X, (is^) - L(X, Oq)) , 
and M2(X) = ^-^(X, Ct{Xn)) — -L(X, f2o)^ , then the previous display can 
be written as 

L(S,0(A))-L(S,0(A„)) 

< -G'„Mi(X)--G'„M2(X) 

n n 

+ (PL(X, (is,) - PL{X, f2o)) - {PL{X, 0(A„)) - PL(X, ^^o)) 
=: (I)+(II)+(III)+(IV). 
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We first bound (I) from above. Classical maximum likelihood theory tells us 
that H„ := ^/n{{ls^ — fio) is asymptotically normal ( for the non- constraine d 



elements of the matrix). Applying Lemma 19.31 in 



van der Vaart 



fll998h . 



we have G„(Mi(X) — H^(iL(X, Qo)/dfl) and then it is ea sily seen that 



Fan et al. 



(120081 ) have shown 



G„(Mi(X)) = Op(l). Similarly, given that 
that f2(A„) is also asymptotically normal, we have GnM2(X.) = Op(l). For 
(III) and (IV), a simple Taylor expansion around O yields both term to be 
of order Op{l/n). 
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